by CaffeineOverflow.
Twitter, as an online social media platform, is a place where we relax ourselves, interact with others and know what’s happening around the world. Also, over the past few years, it has been a distinct social medium where the president of the United States communicates with the public. Thus it’s without doubt an indispensable part of many people’s daily life. Naturally, we’d like to figure out what one can do to make a tweet seen by more people and to gain more interactions, and thus to make oneself more influential.
Here we mainly focus on analyzing the factors that covary with the original post's retweet count. In the original paper, they analyzed information diffusion at user and tweet level. At the user level, they employed multilevel generalized models for predicting retweetability and retweet count received by the original tweets. The results show us the number of followers/ followees are positively associated with retweet count while the number of reciprocal ties are negatively correlated. However, the results only tell us they are positively/negatively correlated. We still have little idea of how exactly are they correlated: for example, Does the retweet count linearly increase with the number of followers? Is there certain "threshold effect"? Also, we are interested in whether posting in English - an international language - helps in getting more retweet count. If that is the case, one might choose to post in English! At the tweet level, they found that the presence of hashtags is positively correlated and the presences of URLs and mentions are negatively correlated. This is something interesting! It suggests posting with a hashtag! However, one might also be interested to know when is the best time to post. A hint is suggested by the paper that users are more active in weekdays than on weekends and during a day, 8pm is when the number of active users reach the peak. Well, is it really that posts posted at 8pm get the most retweet count on average? Besides, Christmas is coming, can we get more retweets during holidays?
Let's explore together!
EgoTimelines: The dataset contains ego users’ dynamic activities, including posting original tweets, retweeting, replying and @-mentioning. It also contains information about the number of retweets, presence of URLs and presence of hashtags for each tweet. The dataset is ideal to analyze the contributions of factors at the post level to the post’s influence.
EgoAlterProfiles: The dataset contains sampled users profiles, including the number of followers, languages, account created time, etc. Combined with the “retweet_count” term in EgoTimelines dataset we can analyze the effect of the number of followers and language (factors on the user level) on the post’s influence.
EgoNetwork: This dataset contains all pairs of ego-alter relationships, where the number of followees for each ego users can be calculated and further analyzed.
import pandas as pd
import numpy as np
import numpy.random as rn
import matplotlib.pyplot as plt
import seaborn as sns
import math
import warnings
warnings.filterwarnings("ignore")
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
from IPython.display import Image
import plotly.express as px
from plotly.subplots import make_subplots
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
plotly.offline.init_notebook_mode() # to html
%matplotlib inline
# first read the data
ego_profiles = pd.read_csv('./EgoAlterProfiles.txt' , sep ='\t') # txt file is seperated by \t
ego_profiles.head()
| ID | IsEgo | followers_count | friends_count | statuses_count | utc_offset | lang | created_at | protected | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | True | 30.0 | 28 | 9.0 | NaN | en | Wed Oct 26 03:30:32 +0000 2011 | False |
| 1 | 2 | True | 2.0 | 8 | 7.0 | -14400.0 | en | Mon Jul 27 20:46:32 +0000 2009 | False |
| 2 | 3 | True | 1.0 | 0 | 0.0 | NaN | fr | Mon Apr 23 20:57:26 +0000 2012 | False |
| 3 | 4 | True | 3.0 | 3 | 68.0 | NaN | en | Sun Feb 14 07:50:39 +0000 2010 | False |
| 4 | 5 | True | 65.0 | 118 | 748.0 | NaN | fr | Mon Jun 11 14:17:06 +0000 2012 | False |
egos have egoID ranging from 1-34006, alters have id ranging from 34007-2516190
# read timeline data
# checked out encoding type is ISO-8859-1
# parse datetime string in to Datetime
ego_timeline = pd.read_csv(dir_path+'EgoTimelines.txt', sep ="\t",encoding = "ISO-8859-1",keep_default_na = False,
parse_dates = [4] )
ego_timeline.head(3)
| replyto_userid | retweeted_userid | id | tweetid | created_at | hashtags | urls | mentions_ids | retweet_count | egoID | retweetedUserID | replytoUserID | metionID | kind | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 398477318 | 249552537051598848 | 2012-09-22 16:55:35+00:00 | 2810549959 | 0 | 1 | NA | NA | Original | |||||
| 1 | 398477318 | 249537963145433088 | 2012-09-22 15:57:41+00:00 | 2810549959 | 0 | 1 | NA | NA | Original | |||||
| 2 | 398477318 | 129377372209299456 | 2011-10-27 02:02:23+00:00 | 0 | 1 | NA | NA | NA | Original |
# it's a csv file with \t as the seperator
ego_networks = pd.read_csv('./data/EgoNetworks.txt', sep="\t")
ego_networks.head(3)
| egoID | followerID | followeeID | |
|---|---|---|---|
| 0 | 1 | 1 | 1573741 |
| 1 | 1 | 1 | 1662720 |
| 2 | 1 | 1 | 1968904 |
ego_profiles = ego_profiles[ego_profiles['IsEgo']]
(The number of followers/ reciprocal ties (friends) are already given in ego profiels.)
num_followee = ego_networks.groupby('followerID').size()
ego_profiles['followees_count'] = [num_followee[ego_id] if ego_id in num_followee else 0 for ego_id in ego_profiles['ID']]
ego_profiles.head(3)
| ID | IsEgo | followers_count | friends_count | statuses_count | utc_offset | lang | created_at | protected | followees_count | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | True | 30.0 | 28 | 9.0 | NaN | en | Wed Oct 26 03:30:32 +0000 2011 | False | 28 |
| 1 | 2 | True | 2.0 | 8 | 7.0 | -14400.0 | en | Mon Jul 27 20:46:32 +0000 2009 | False | 8 |
| 2 | 3 | True | 1.0 | 0 | 0.0 | NaN | fr | Mon Apr 23 20:57:26 +0000 2012 | False | 0 |
Original tweets are statuses that are not replies or retweets.
# filter out the original tweets
ego_original = ego_timeline[np.bitwise_and(ego_timeline['replyto_userid'].apply(lambda a: len(a)>0),
ego_timeline['retweeted_userid'].apply(lambda a: len(a)>0))]
# sanity check:
# whether these original tweets are sent by the egos in ego_profiles
ego_id = ego_profiles['ID']
for i in ego_timeline['egoID']:
assert i in ego_id
tweeted_ego_id = ego_timeline['egoID'].unique()
originalTweeted_ego_id = ego_original['egoID'].unique()
# get the number of original posts for each ego
ego_numOriginal = ego_original.groupby('egoID').size()
# get the total number of retweet counts for each ego
ego_numRetweet = ego_original.groupby('egoID').median()['retweet_count']
# get the median number of retweet count for each ego
ego_profiles['tweeted'] = [ego_id in tweeted_ego_id for ego_id in ego_profiles['ID']]
ego_profiles['originalTweeted'] = [ego_id in originalTweeted_ego_id for ego_id in ego_profiles['ID']]
ego_profiles['numOriginal'] = [ego_numOriginal[ego_id] if ego_id in originalTweeted_ego_id else 0 for ego_id in ego_profiles['ID']]
ego_profiles['medianRetweet'] = [ego_numRetweet[ego_id] if ego_id in originalTweeted_ego_id else 0 for ego_id in ego_profiles['ID']]
# In case fake accounts may influence the results, we deleted those users who have not posted anything.
ego_tweeted_profiles = ego_profiles[ego_profiles['tweeted']]
features = ['followers_count', 'followees_count', 'friends_count']
colors_features = sns.color_palette("Set2").as_hex()[:len(features)]
fig = px.scatter(ego_tweeted_profiles, x=features[0], y="medianRetweet",
size="medianRetweet", hover_data=[features[1], features[2]])
fig.update_layout(xaxis_type="log",
yaxis_title='user\'s median retweet_count',
title='How does user\' s retweet count varies with their '+features[0])
fig.update_traces(marker=dict(color=colors_features[0]))
# convert the dynamic plotly plot
# to the static treemap image
img_bytes = plotly.io.to_image(fig, scale=1.5)
Image(img_bytes)
fig = px.scatter(ego_tweeted_profiles, x=features[1], y="medianRetweet",
size="medianRetweet", hover_data=[features[0], features[2]])
fig.update_layout(xaxis_type="log",
yaxis_title='user\'s median retweet_count',
title='How does user\' s retweet count varies with their '+features[1])
fig.update_traces(marker=dict(color=colors_features[1]))
# convert the dynamic plotly plot
# to the static treemap image
img_bytes = plotly.io.to_image(fig, scale=1.5)
Image(img_bytes)
fig = px.scatter(ego_tweeted_profiles, x=features[2], y="medianRetweet",
size="medianRetweet", hover_data=[features[0], features[1]])
fig.update_layout(xaxis_type="log",
yaxis_title='user\'s median retweet_count',
title='How does user\' s retweet count varies with their '+features[2])
fig.update_traces(marker=dict(color=colors_features[2]))
# convert the dynamic plotly plot
# to the static treemap image
img_bytes = plotly.io.to_image(fig, scale=1.5)
Image(img_bytes)
First we noticed that the average numbers of retweet count are not very large: mostly are below 10, some are more than 20. Further we seperated egos into super influencers and normal users according to the average number of retweet counts and analyze seperately.
# seperated egos into super influencers and normal users according to the average number of retweet counts
super_influencer = ego_tweeted_profiles[ego_tweeted_profiles['medianRetweet'] >= 20]
normal_user = ego_tweeted_profiles[ego_tweeted_profiles['medianRetweet'] < 20]
super_influencer[['ID','followers_count','followees_count','friends_count','numOriginal','medianRetweet']]
| ID | followers_count | followees_count | friends_count | numOriginal | medianRetweet | |
|---|---|---|---|---|---|---|
| 4325 | 4326 | 67.0 | 0 | 0 | 6 | 38.0 |
| 16359 | 16360 | 126.0 | 1498 | 1500 | 1 | 109.0 |
| 17158 | 17159 | 1.0 | 50 | 50 | 11 | 48.0 |
| 17821 | 17822 | 10.0 | 151 | 150 | 1 | 31.0 |
| 18669 | 18670 | 42342.0 | 18890 | 18915 | 1349 | 25.0 |
| 26428 | 26429 | 776.0 | 0 | 0 | 232 | 29.5 |
| 32949 | 32950 | 14954.0 | 14201 | 13777 | 29 | 22.0 |
Here two users actually draw our attention: 1. user 18670, who seems to be a famous person as she has lots of followers; 2. user 17159 - a user just like us! - with only one follower. But how come does she make it to have so many retweets?
# examine 18670
user18670 = ego_original[ego_original['egoID']==18670]
user18670['has_tag'] = user18670.loc[:,'hashtags'].apply(lambda hashtag: not (type(hashtag)==float and math.isnan(hashtag)))
# examine 17159
user17159 = ego_original[ego_original['egoID']==17159]
user17159['has_tag'] = user17159.loc[:,'hashtags'].apply(lambda hashtag: not (type(hashtag)==float and math.isnan(hashtag)))
# plot
plt.rcParams["patch.force_edgecolor"] = True
plt.figure(figsize=(10,8))
ax0 = plt.subplot(211) # plot the comparison of user18670 and user17159's three factors
x = ['followers_count','followees_count','friends_count'] # three factors
xloc = np.arange(len(x)) # the label locations
width = 0.35 # the width of the bars
ax0.bar(xloc - width/2,super_influencer.loc[17158][x].values, 0.35, label='17159',log=True)
ax0.bar(xloc + width/2,super_influencer.loc[18669][x].values,0.35, label='18670',log=True)
ax0.legend()
ax0.set_ylabel('num')
ax0.set_xticks(xloc)
ax0.set_xticklabels(x)
ax0.set_title('comparing user18670 and user17159\'s three factors')
ax1 = plt.subplot(223) # plot the retweet count distribution of user 18670
sns.histplot(data=user18670, x='retweet_count', hue="has_tag", multiple="stack",ax=ax1)
ax1.set_ylabel('numbet of posts')
ax1.set_title('user18670\'s retweeted count')
ax2 = plt.subplot(224) # plot the retweet count distribution of user 17159
sns.histplot(data=user17159, x='retweet_count', hue="has_tag", multiple="stack",ax=ax2,)
ax2.set_ylabel('numbet of posts')
ax2.set_title('user17159\'s retweeted count')
plt.suptitle('user18670 and user17159 activity patterns')
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # adjust the subplot geometry
# so that the sup title does not overlap with the subplot
# plt.savefig('user18670andUser17159.svg')
plt.show()
From above analysis, we realize that, if you already have many followers, you might not need to do much to get retweets; but if you do not, just like user 17159, it's a smart choice to post with hashtags, which will make your posts seen by more!
Now for the normal users, to exclude the effects of hashtags (and also urls, mentions as they are shown to have certain effects in the original paper), we consider only the original posts with aimple plain texts.
# filter out the original posts with aimple plain texts.
ego_original['simpleText'] = ego_original.apply(lambda y:
type(y['hashtags'])==float and math.isnan(y['hashtags'])
and type(y['urls'])==float and math.isnan(y['urls'])
and type(y['mentions_ids'])==float and math.isnan(y['mentions_ids'])
,axis=1)
ego_numRetweet_simple = ego_original[ego_original['simpleText']].groupby('egoID').median()['retweet_count']
# whether one has posted a simple plain tweet
normal_user['tweeted_simple'] = [ego_id in ego_numRetweet_simple for ego_id in normal_user['ID']]
# median of the retweet count of one's simple posts
normal_user['medianRetweet_simple'] = [ego_numRetweet_simple[ego_id] if ego_id in ego_numRetweet_simple else 0 for ego_id in normal_user['ID']]
normal_user_with_simple_posts = normal_user[normal_user['tweeted_simple']]
def _group(n):
if n == 0 or n == 1:
return 1
return np.ceil(np.log10(n))
# we group followers/ followees/ friends count with certain defined intervals
normal_user_with_simple_posts['followers_count_group'] = normal_user_with_simple_posts['followers_count'].apply(_group)
normal_user_with_simple_posts['followees_count_group'] = normal_user_with_simple_posts['followees_count'].apply(_group)
normal_user_with_simple_posts['friends_count_group'] = normal_user_with_simple_posts['friends_count'].apply(_group)
# plot the histogram of the distribution of retweet count for each interval
figure,ax = plt.subplots(5,3,figsize=(15,5),sharex=True, sharey=True)
color = sns.color_palette("tab10")
group_name = ['<=10', '(10,100]', '(100,1000]','(1000,10000]','>10000']
for groupi in range(5):
fig1=sns.violinplot(x=normal_user_with_simple_posts[normal_user_with_simple_posts['followers_count_group']==groupi+1]['medianRetweet_simple'],
color=colors_features[0], ax = ax[groupi][0])
fig2=sns.violinplot(x=normal_user_with_simple_posts[normal_user_with_simple_posts['followees_count_group']==groupi+1]['medianRetweet_simple'],
color=colors_features[1], ax = ax[groupi][1])
fig3=sns.violinplot(x=normal_user_with_simple_posts[normal_user_with_simple_posts['friends_count_group']==groupi+1]['medianRetweet_simple'],
color=colors_features[2], ax = ax[groupi][2])
if groupi == 0:
ax[groupi][0].set_title('grouped factor: followers count \n'+group_name[groupi])
ax[groupi][1].set_title('grouped factor: followees count \n'+group_name[groupi])
ax[groupi][2].set_title('grouped factor: friends count \n'+group_name[groupi])
else:
for ci in range(3):
ax[groupi][ci].set_title(group_name[groupi])
if groupi < 4:
# remove redudent xlabels
fig1.set(xlabel=None)
fig2.set(xlabel=None)
fig3.set(xlabel=None)
plt.tight_layout()
plt.suptitle('distribution of retweet counts within grouped factors')
plt.subplots_adjust(top=0.85)
plt.show()
We noticed that as the followers/followees/friends counts increases (from the first row to the fifth row) the probability of receiving more retweets increases (the plot gets "fatter"). Especially when number reaches three digits the effect starts to become significant.
We admit that there might be strong multicollinearity among these factors, as shown in the next plot - especially between the number of followers and the number of friends. But our conclusion remains the same: By following more people, being more interactive (getting more friends), the user can hopefully get more followers and more retweets. When the figure reaches two digits from one digit, the user might not see much increase in terms of retweet count of a plain post, but once one has made it to three figures, he can expect to see some significant changes.
fig, ax = plt.subplots(1,3, sharex=True, sharey=True, figsize=(18,6))
axi = 0 # which axes
for xlabel, ylabel in zip(['followers_count','followees_count','friends_count'],['followees_count','friends_count','followers_count']):
ax[axi].scatter(normal_user_with_simple_posts[xlabel], normal_user_with_simple_posts[ylabel], s=1,alpha=0.3)
ax[axi].set_xscale('log') # same as above, we use log as the values are skewed
ax[axi].set_xlabel(xlabel)
ax[axi].set_xlim([1,np.max(normal_user_with_simple_posts[xlabel])]) # constrain on x limit
ax[axi].set_yscale('log') # same as above, we use log as the values are skewed
ax[axi].set_ylabel(ylabel)
ax[axi].set_ylim([1,np.max(normal_user_with_simple_posts[ylabel])]) # constrain on y limit
axi += 1
plt.suptitle('collinearity between factors')
plt.show()
We further attempt to study the impact of language by considering language='en' (english) as our "treatment".
super_influencer[['lang','ID','followers_count','followees_count','friends_count','numOriginal','medianRetweet']]
| lang | ID | followers_count | followees_count | friends_count | numOriginal | medianRetweet | |
|---|---|---|---|---|---|---|---|
| 4325 | id | 4326 | 67.0 | 0 | 0 | 6 | 38.0 |
| 16359 | en | 16360 | 126.0 | 1498 | 1500 | 1 | 109.0 |
| 17158 | th | 17159 | 1.0 | 50 | 50 | 11 | 48.0 |
| 17821 | en | 17822 | 10.0 | 151 | 150 | 1 | 31.0 |
| 18669 | en | 18670 | 42342.0 | 18890 | 18915 | 1349 | 25.0 |
| 26428 | en | 26429 | 776.0 | 0 | 0 | 232 | 29.5 |
| 32949 | es | 32950 | 14954.0 | 14201 | 13777 | 29 | 22.0 |
First we noticed, for the super influencer, there are both english and non-english speakers.
Then we create a regressor that predicts median retweet counts for each user.
normal_user_with_simple_posts['english'] = normal_user['lang'].apply(lambda lang: int(lang=='en'))
mod = smf.ols(formula='medianRetweet_simple ~ followers_count + followees_count + friends_count + statuses_count + C(english) ',
data=normal_user_with_simple_posts)
res = mod.fit()
print(res.summary())
OLS Regression Results
================================================================================
Dep. Variable: medianRetweet_simple R-squared: 0.020
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 52.59
Date: Wed, 16 Dec 2020 Prob (F-statistic): 3.27e-54
Time: 11:28:33 Log-Likelihood: -4402.8
No. Observations: 13009 AIC: 8818.
Df Residuals: 13003 BIC: 8862.
Df Model: 5
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 0.0286 0.004 6.355 0.000 0.020 0.037
C(english)[T.1] -0.0002 0.006 -0.040 0.968 -0.012 0.011
followers_count 1.584e-05 2.13e-06 7.450 0.000 1.17e-05 2e-05
followees_count 4.933e-05 0.000 0.235 0.814 -0.000 0.000
friends_count 2.262e-05 0.000 0.107 0.915 -0.000 0.000
statuses_count -4.786e-07 2.82e-07 -1.696 0.090 -1.03e-06 7.45e-08
==============================================================================
Omnibus: 31608.003 Durbin-Watson: 1.995
Prob(Omnibus): 0.000 Jarque-Bera (JB): 624692449.726
Skew: 25.255 Prob(JB): 0.00
Kurtosis: 1075.348 Cond. No. 2.52e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Coefficient for 'english' can be interpreted as the difference in the predicted value in decision for each one-unit difference in 'english' if other independent variables remain constant. Since 'english' is a categorical variable coded as 0 or 1, a one unit difference represents switching from one category to the other. So compared to users that do not speak English, we would expect English speakers to decrease the retweet count by 0.04. However this coefficient has a p value of 0.968, which is absolutely not statiscally significant. The result suggests posting in an international language or not like English will not affect the retweet count significantly. (Here we assume that
So far we've studied this question very grossly, next we are going to examine more closely: things might be different for each language specifically as different population might have different habit of usage.
# normalize the language by removing the part after '-'
normal_user["lang_normalized"]=normal_user["lang"].apply(lambda x:str(x).split("-")[0])
# remove the rare languages - -\
lang_numusers = normal_user['lang_normalized'].value_counts() # number of users use this language
lang_selected = lang_numusers[lang_numusers>100].index
normal_user_of_certainLanguages = normal_user[normal_user['lang'].apply(lambda l: l in lang_selected)]
# build a new data frame for the further analysis
# select the needed fields and summarize with mean statistic
data_lang=normal_user_of_certainLanguages[["lang_normalized","followers_count","friends_count","statuses_count", "followees_count","medianRetweet"]].groupby("lang_normalized").mean()
data_lang.reset_index(inplace=True)
data_lang['lang_readable']=['Arabic','German','English','Spanish','Franch','Indonesian','Italian','Japanese','Korean','Dutch','Portuguese','Russian','Turkish']
data_lang
| lang_normalized | followers_count | friends_count | statuses_count | followees_count | medianRetweet | lang_readable | |
|---|---|---|---|---|---|---|---|
| 0 | ar | 211.878505 | 194.623498 | 803.088117 | 193.311081 | 0.058745 | Arabic |
| 1 | de | 101.870968 | 150.225806 | 175.206452 | 150.600000 | 0.019355 | German |
| 2 | en | 105.636344 | 106.682046 | 711.406638 | 106.459304 | 0.028727 | English |
| 3 | es | 72.768288 | 122.875358 | 577.928484 | 122.639967 | 0.029832 | Spanish |
| 4 | fr | 52.528037 | 78.121495 | 441.957944 | 77.766355 | 0.021028 | Franch |
| 5 | id | 58.253359 | 89.600768 | 644.166987 | 89.345489 | 0.025912 | Indonesian |
| 6 | it | 37.542857 | 79.690476 | 310.280952 | 79.342857 | 0.028571 | Italian |
| 7 | ja | 105.637848 | 123.972142 | 2705.913545 | 123.537944 | 0.047550 | Japanese |
| 8 | ko | 23.278049 | 47.878049 | 254.770732 | 47.800000 | 0.017073 | Korean |
| 9 | nl | 63.635514 | 84.560748 | 1297.570093 | 84.364486 | 0.051402 | Dutch |
| 10 | pt | 129.308656 | 124.325740 | 905.523918 | 124.014806 | 0.030182 | Portuguese |
| 11 | ru | 80.333333 | 134.877333 | 280.381333 | 137.496000 | 0.005333 | Russian |
| 12 | tr | 135.216783 | 169.480186 | 332.692308 | 168.230769 | 0.032634 | Turkish |
labels = data_lang['lang_readable'].values
for col in ['medianRetweet','followers_count','followees_count','statuses_count']:
data_lang[col+'_rounded'] = round(data_lang[col],2)
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}],[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=data_lang['medianRetweet_rounded'], name="Averaged median retweet count"),
1, 1)
fig.add_trace(go.Pie(labels=labels, values=data_lang['followers_count_rounded'], name="Averaged followers count"),
2, 1)
fig.add_trace(go.Pie(labels=labels, values=data_lang['followees_count_rounded'], name="Averaged followees count"),
2, 2)
fig.add_trace(go.Pie(labels=labels, values=data_lang['statuses_count_rounded'], name="Averaged number of statuses"),
1, 2)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name",textinfo='value')
fig.update_layout(
title_text="averaged retweet count and followers count of different language speakers",
# Add annotations in the center of the donut pies.
annotations=[dict(text='retweets', x=0.19, y=0.81, font_size=10, showarrow=False),
dict(text='statuses', x=0.81, y=0.81, font_size=10, showarrow=False),
dict(text='followees', x=0.19, y=0.19, font_size=10, showarrow=False),
dict(text='followers', x=0.81, y=0.19, font_size=10, showarrow=False)])
fig.show()
# convert the dynamic plotly plot
# to the static treemap image
img_bytes = plotly.io.to_image(fig, scale=1.5)
Image(img_bytes)
Looking at the user language data, we can recognize some patterns in terms of the correlation between user language and user activeness in terms of retweeting. In particular, we can categorize the countries into 3 categories:
But let's don't jump into the conclusion. Let's take a closer look! - Distribution should be more precise than a summary statistic.
# three groups of languages
langs=[['ar'],['ja','nl'],['tr','ru','de']]
langs_readable = ['Arabic', 'Japanese', 'Dutch', 'Turkish', 'Russian', 'German']
ngroup = len(langs)
colors_groups = sns.color_palette().as_hex()[:ngroup]
# plot
nlang = 6
plt.rcParams["patch.force_edgecolor"] = False
fig, axes = plt.subplots(4,nlang, sharex='row',sharey='row',figsize=(15,15))
axi = 0
for gi in range(ngroup):
for li, lang in enumerate(langs[gi]):
# plot each field
sns.distplot(normal_user_of_certainLanguages[normal_user_of_certainLanguages['lang']==lang]['followers_count'],ax=axes[0][axi],color=colors_groups[gi])
sns.distplot(normal_user_of_certainLanguages[normal_user_of_certainLanguages['lang']==lang]['friends_count'],ax=axes[1][axi],color=colors_groups[gi])
sns.distplot(normal_user_of_certainLanguages[normal_user_of_certainLanguages['lang']==lang]['statuses_count'],ax=axes[2][axi],color=colors_groups[gi])
sns.distplot(normal_user_of_certainLanguages[normal_user_of_certainLanguages['lang']==lang]['medianRetweet_simple'],ax=axes[3][axi],color=colors_groups[gi])
axes[0][axi].set_title(langs_readable[axi])
# log is needed otherwise literally nothing can be read from the plot..
axes[0][axi].set_yscale('log')
axes[1][axi].set_yscale('log')
axes[2][axi].set_yscale('log')
axes[3][axi].set_yscale('log')
axi += 1
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # adjust the subplot geometry
# so that the sup title does not overlap with the subplot
plt.suptitle('different language users activity patterns')
# plt.savefig('specificLanguages.svg') # save for data story
plt.show()
The fanatics: Arabic speakers have the highest average number of followers count, and they also have the highest retweets frequency among all langauge user groups. (Here we assume a high retweet count suggests that language users are more likely to retweet, and suggests they will more likely to retweet what you have posted.) Therefore, if you happen to be an Arabic speaker, don't hesitate about posting in Arabic!
The actives: Dutch speaker, and Japanese speakers share a common point: they are enthusiastic about tweeting and retweeting (!) even though they don’t necessarily have a high number of followers count. This may be explained as they really focus on their favorite followers and keep interacting with them rather than follow many people but remain inactive. Therefore, if you are Dutch or Japanese speaker, we would suggest you to posting in these languages as your post might get a higher chance of being retweeted by your people!
(Well, maybe Japanese speakers can be categerized into "the fanatics" in terms of their activity pattern - they like following people, tweeting and retweeting! And their friends counts are also crazily high which suggests that they like interacting with people on tweet.)
The silent group: Turkish speaker, Russian speaker and German speakers are the exact opposite of the previous category. They have a high average follower count, which means they are active on this social media. But in the same time they have a low tweet and retweet frequqncy, which could be explained as they are more cautious about expressing their opinions (tweeting/ retweeting what others say). Therefore, if you are Turkish, Russian or German speaker, you might like to post in English to change your audiances to those who are more willing to retweet. :P
Retweet pattern may vary for different kinds of tweets, we will first determine the tweet type and check out the distribution by plot.
def tweet_type(x):
'''0 for original tweet, 1 for replies, 2 for retweeting non-egos, 3 for retweeting egos'''
if x.replyto_userid != '':
return 'Replies'
elif x.retweeted_userid != '':
if x.retweetedUserID != 'NA' and int(x.retweetedUserID) < 34007:
return 'RT_egos'
else:
return 'RT_other'
else:
return 'Original'
ego_timeline['kind'] = ego_timeline.apply(tweet_type,axis=1)
# define utiliy function for later grouping
# each one returns the corresponding quantile
def q90(x):
return x.quantile(0.9)
def q95(x):
return x.quantile(0.95)
def q85(x):
return x.quantile(0.85)
# memory usage too big not displayed here.
fig = px.box(ego_timeline[['kind','retweet_count']],x = 'kind', y = 'retweet_count',log_y = True,points = False)
fig.write_html('./plots/sboxplot.html')

ego_timeline.groupby(['kind']).agg({'retweet_count':['mean',q85,q90,q95,'count']})
| retweet_count | |||||
|---|---|---|---|---|---|
| mean | q85 | q90 | q95 | count | |
| kind | |||||
| Original | 0.243366 | 0.00 | 0.0 | 1.00 | 2514619 |
| RT_egos | 118.071429 | 41.35 | 227.9 | 650.35 | 14 |
| RT_other | 2407.404925 | 1258.00 | 2359.0 | 6196.10 | 1054979 |
| Replies | 0.074003 | 0.00 | 0.0 | 1.00 | 1132646 |
RT_egos are retweets that are retweeted from egos. RT_others are other retweets. Original tweets are not replies or retweets. Displayed in the table are the mean, 85th,90th,95th quantile, and tweet count.
We can see that the quantiles for original post and replies are 0, but not for retweets. That means only a small fraction of tweets are retweeted. Notice that the retweets don't have retweet count themselves, the value just shows how many times the same original tweet has been retweeted. So we exclude the retweets from our retweet count analysis.\ We can see that most retweets are tweeted from users other than egos, which makes sense since egos are just small samples of the whole population.\
Before checking out how tweeting time effects the retweet_count, let's first check out how Twitter usage changes according to time. The orignial paper examined the circadian rhythm of the day and week. So here, we examine the rhythm of the year.\ According to supplementary information, only egos with utc_offset information are used to produce the circadian cycles.
# fetch ego ID and utc_offset
user_utc = ego_profiles[['ID','utc_offset','lang']]
user_utc = user_utc[pd.notnull(user_utc['utc_offset'])]
# rename the column for later merge
user_utc.rename(columns = {'ID':'egoID'},inplace = True)
# fetch egoID and creation time for each tweet
tweets = ego_timeline[['egoID','created_at','retweet_count','kind']]
# merge ego and tweet information
time = tweets.merge(user_utc,on = 'egoID')
# normalize time to local_time
time['local_time'] = time['created_at'] + pd.to_timedelta(time['utc_offset'],unit = 'S')
# get day of the week,hour,month,year based on local_time
time['weekday'] = time['local_time'].dt.weekday
time['hour'] = time['local_time'].dt.hour
time['month'] = time['local_time'].dt.month
time['year'] = time['local_time'].dt.year
Now group by month, to check out if there's any month pattern to twitter. We also seperate each type of tweet here.
# get the number of active users and number of tweets and retweet statistics
months = time.groupby(['kind','month']).agg(
{'egoID':pd.Series.nunique,
'weekday':'count',
'retweet_count':[q90,'mean',q95,q85]})
# rename for clear usage
months.rename(columns = {'egoID':'num_users','weekday':'tweet_count'},inplace = True)
# flatten the hierachical index
months.columns = ['_'.join(cols).lower() for cols in months.columns.to_flat_index()]
months = months.reset_index() # keep 'kind' as an column instead of index
months.head()
| kind | month | num_users_nunique | tweet_count_count | retweet_count_q90 | retweet_count_mean | retweet_count_q95 | retweet_count_q85 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Original | 1 | 1997 | 109516 | 0.0 | 0.261359 | 1.0 | 0.0 |
| 1 | Original | 2 | 2025 | 105131 | 0.0 | 0.249232 | 1.0 | 0.0 |
| 2 | Original | 3 | 2072 | 125970 | 0.0 | 0.253576 | 1.0 | 0.0 |
| 3 | Original | 4 | 2095 | 131467 | 0.0 | 0.396229 | 1.0 | 0.0 |
| 4 | Original | 5 | 2239 | 146776 | 0.0 | 0.194916 | 1.0 | 0.0 |
def px_line(df,x,y,color,y_label,name):
'''plotting function for reuse, only when month as x-axis'''
fig = px.line(df,x = x, y = y,color=color,template = 'simple_white')
fig.update_xaxes(tick0=1, dtick=1)
fig.update_yaxes(showgrid = False, title_text = title)
fig.write_html('./plots/'+name +'.html')
fig.show()
px_line(months,'month','tweet_count_count','kind','Number of Tweets','lineplot')
px_line(months,'month','num_users_nunique','kind','Number of Users','lineplotusers')
Please check out the html version for interactive plot.
# static graph for web
fig,axes = plt.subplots(1,2,figsize=(16,8))
ylabels = ['Number of Tweets','Number of Users']
for ind,col in enumerate([months['tweet_count_count'],months['num_users_nunique']]):
sns.lineplot(ax = axes[ind],data = months,x = 'month',y = col,hue = 'kind',dashes = False,legend =(ind==1))
axes[ind].set_xticks(np.arange(13))
axes[ind].set_xlabel('Month',fontsize='large')
axes[ind].set_ylabel(ylabels[ind],fontsize='large')
plt.suptitle('Twitter Usage across Month')
plt.tight_layout()
plt.savefig('./plots/usage_month_type.svg')
plt.show()
Wow, it seems like the usage of Twitter increases from November to next year's October and then suddenly drops to the bottom and start to increase again, Why is that? Maybe we can check on the whole dataset. Since the possibilty to get wrong month due to utc_offset is not so great.
Include year in the plot this time, maybe not all years have the same pattern.
# group by year and month
months_unnorm = ego_timeline.groupby([ego_timeline.created_at.dt.year,
ego_timeline.created_at.dt.month]).agg(
{'egoID':pd.Series.nunique,'id':'count'})
months_unnorm.index.names = ['year','month']
months_unnorm = months_unnorm.reset_index()
months_unnorm.rename(columns={'egoID':'num_users','id':'num_tweets'},inplace=True)
px_line(months_unnorm,'month','num_tweets','year','Number of Tweets','monthtweets')
px_line(months_unnorm,'month','num_users','year','Number of Users','monthusers')
# the static version
fig,axes = plt.subplots(1,2,figsize=(16,8))
ylabels = ['Number of Tweets','Number of Users']
for ind,col in enumerate(['num_tweets','num_users']):
sns.lineplot(ax = axes[ind],data = months_unnorm,x = 'month',y = col,hue = 'year',
dashes = False,palette='bright',legend = (ind==0))
axes[ind].set_xticks(np.arange(13))
axes[ind].set_xlabel('Month',fontsize='large')
axes[ind].set_ylabel(ylabels[ind],fontsize='large')
plt.suptitle('Twitter Usage across Month for different years')
plt.tight_layout()
plt.savefig('./plots/usage_month.png')
plt.show()
Ah-hah, it turns out that there are more tweets in 2014 than other years. But the authors only colleted data before the November of 2014. And the usage of Twitter increases through out the year, but very likely, it's just because more people start to use Twitter over time.\ We will keep in mind that twitter usage grows as time goes by, and that the data in year 2014 is not complete, and then continue our study of factors infuencing retweet count. We would expect that the number of retweets also grow as time goes by.
# grouping by date for all the tweets to check the growth trend
all_retweets = ego_timeline[(ego_timeline['kind'] == 'RT_egos')|(ego_timeline['kind'] == 'RT_others')]
day_unnorm = all_retweets.groupby(all_retweets.created_at.dt.date).agg({'id':'count'})
day_unnorm = day_unnorm.reset_index()# keep date as a seperate column
day_unnorm.rename(columns={'egoID':'num_users','id':'num_tweets'},inplace=True)
#Plotting the trend
fig,axe = plt.subplots(figsize=(8,8))
sns.lineplot(data = day_unnorm,x = 'created_at',y = 'num_tweets')
axe.set_xlabel('Day',fontsize='large')
axe.set_ylabel('Number of Reweets',fontsize='large')
plt.suptitle('Retweet Growth Trend')
plt.tight_layout()
plt.savefig('./plots/overall_trend.svg')
plt.show()
Clearly, they didn't collect all the data for the last day : ) But we can see the overall trend for the number of retweets is growing.
Now let's check out how tweeting time (hour and day of the week) effects the retweet_count. We will focus on non_retweets (explained in step 3.1).
First extract non_retweets(original posts and replies) and check the distribution of retweet count.
non_retweets = time[(time.kind =='Original')|(time.kind == 'Replies')].copy()
print("The maximum of retweet count is {}, and only {:2.2%} of non_retweets have at least 1 retweet,\
only {:2.2%} of non_retweets have at least 3 retweets, and only {:2.3%} have at least 10 retweets.".format(
max(non_retweets['retweet_count']),
len(non_retweets[non_retweets.retweet_count != 0])*1./len(non_retweets),
len(non_retweets[non_retweets.retweet_count > 2])*1./len(non_retweets),
len(non_retweets[non_retweets.retweet_count > 9])*1./len(non_retweets),
))
The maximum of retweet count is 16337, and only 8.19% of non_retweets have at least 1 retweet,only 1.17% of non_retweets have at least 3 retweets, and only 0.199% have at least 10 retweets.
Since we want to find out how to maximize retweet count, we will focus on those tweets that are retweeted at least 10 times below.
most_tweeted = non_retweets[non_retweets.retweet_count > 9]
We first check out how they are distributed.
sns.displot(data = most_tweeted, x = 'retweet_count')
plt.xscale('log')
plt.yscale('log')
plt.title('log-log plot of number of tweets with certain retweet_count')
plt.tight_layout()
plt.show()
It is close to power law distribution. Now, finally we will get to know if we can maximize our retweet_count by choosing the right time.
We group by day of the week and hour to see if the pattern for the most tweeted ones is different from the overall pattern
most_tweeted_hour = most_tweeted.groupby(['weekday','hour'])['egoID'].count()
most_tweeted_hour = most_tweeted_hour.reset_index()
most_tweeted_hour.rename(columns={'egoID':'num_tweets'}, inplace = True)
non_retweets_hour = non_retweets.groupby(['weekday','hour'])['egoID'].count()
non_retweets_hour = non_retweets_hour.reset_index()
non_retweets_hour.rename(columns={'egoID':'num_tweets'}, inplace = True)
mapping = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
most_tweeted_hour['weekday'] = most_tweeted_hour['weekday'].map(mapping)
non_retweets_hour['weekday'] = non_retweets_hour['weekday'].map(mapping)
# set figure subplot and size
fig,axes = plt.subplots(1,2,figsize=(16,8))
for ind, (col,label) in enumerate(zip(
[most_tweeted_hour,non_retweets_hour],
['Number of tweets that are retweeted at least ten times',
'Number of original tweets and replies'])):
sns.lineplot(ax = axes[ind],data = col,x = 'hour', y = 'num_tweets',
hue = 'weekday',palette='colorblind')
axes[ind].set_xticks(np.arange(24))
axes[ind].set_xlabel('Hour',fontsize='large')
axes[ind].set_ylabel(label,fontsize='large')
plt.suptitle('Comparision between the most Tweeted Tweets and All Tweets',fontsize = 'large')
plt.savefig('./plots/hour_week_trend.svg')
plt.tight_layout()
plt.show()
fig = px.line(most_tweeted_hour,x = 'hour', y = 'num_tweets',color='weekday',
template = 'simple_white', width=915, height=400)
fig.update_xaxes(tick0=0, dtick=1)
fig.update_yaxes(showgrid = False, title_text = 'Num of Most Retweeted Original Tweets and Replies')
fig.write_html('./plots/most_tweeted.html')
fig.show()
fig = px.line(non_retweets_hour,x = 'hour', y = 'num_tweets',color='weekday',
template = 'simple_white', width=915, height=400)
fig.update_xaxes(tick0=0, dtick=1)
fig.update_yaxes(showgrid = False, title_text = 'Number of Origianl Tweets and Replies')
fig.write_html('./plots/non_retweets_hour.html')
fig.show()
From the above figure, we can tell that the number of most retweeted origianl tweets and replies don't follow the general trend to increase from 5-11h and 12-21h. instead, most of these tweets are posted during 9.a.m.-11 a.m. \ But does the posting time actually affect retweet count? We can first check the correlation with a regression model.
mod = smf.ols(formula = 'retweet_count ~ C(hour)+C(weekday) ', data = non_retweets)
res = mod.fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: retweet_count R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 2.576
Date: Fri, 18 Dec 2020 Prob (F-statistic): 6.73e-06
Time: 12:25:13 Log-Likelihood: -9.1996e+06
No. Observations: 2377635 AIC: 1.840e+07
Df Residuals: 2377605 BIC: 1.840e+07
Df Model: 29
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 0.1639 0.039 4.200 0.000 0.087 0.240
C(hour)[T.1] 0.0199 0.054 0.367 0.713 -0.086 0.126
C(hour)[T.2] 0.0148 0.061 0.244 0.807 -0.104 0.133
C(hour)[T.3] 0.0291 0.066 0.438 0.662 -0.101 0.159
C(hour)[T.4] 0.0425 0.070 0.605 0.545 -0.095 0.180
C(hour)[T.5] 0.0692 0.068 1.014 0.311 -0.065 0.203
C(hour)[T.6] -0.0060 0.064 -0.094 0.925 -0.132 0.120
C(hour)[T.7] -0.0084 0.058 -0.145 0.885 -0.122 0.105
C(hour)[T.8] 0.0517 0.055 0.945 0.345 -0.056 0.159
C(hour)[T.9] 0.2136 0.053 4.031 0.000 0.110 0.317
C(hour)[T.10] 0.2021 0.051 3.950 0.000 0.102 0.302
C(hour)[T.11] 0.0753 0.050 1.495 0.135 -0.023 0.174
C(hour)[T.12] 0.0002 0.049 0.004 0.997 -0.096 0.097
C(hour)[T.13] 0.0079 0.049 0.159 0.874 -0.089 0.105
C(hour)[T.14] 0.1650 0.049 3.344 0.001 0.068 0.262
C(hour)[T.15] 0.0219 0.049 0.447 0.655 -0.074 0.118
C(hour)[T.16] 0.0115 0.048 0.237 0.813 -0.084 0.107
C(hour)[T.17] -0.0004 0.048 -0.009 0.993 -0.095 0.094
C(hour)[T.18] 0.0300 0.047 0.634 0.526 -0.063 0.123
C(hour)[T.19] -0.0143 0.047 -0.308 0.758 -0.106 0.077
C(hour)[T.20] 0.0082 0.046 0.180 0.857 -0.081 0.098
C(hour)[T.21] -0.0130 0.045 -0.289 0.772 -0.101 0.075
C(hour)[T.22] -0.0081 0.045 -0.179 0.858 -0.096 0.080
C(hour)[T.23] 0.0069 0.046 0.150 0.881 -0.083 0.097
C(weekday)[T.1] 0.0058 0.028 0.208 0.835 -0.049 0.061
C(weekday)[T.2] 0.0138 0.028 0.493 0.622 -0.041 0.068
C(weekday)[T.3] -0.0042 0.028 -0.151 0.880 -0.059 0.050
C(weekday)[T.4] -0.0152 0.028 -0.541 0.589 -0.070 0.040
C(weekday)[T.5] -0.0121 0.028 -0.427 0.669 -0.068 0.044
C(weekday)[T.6] 0.0407 0.028 1.459 0.145 -0.014 0.095
==================================================================================
Omnibus: 18481933.535 Durbin-Watson: 1.969
Prob(Omnibus): 0.000 Jarque-Bera (JB): 278819780448405440.000
Skew: 1222.205 Prob(JB): 0.00
Kurtosis: 1677625.946 Cond. No. 25.0
==================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We notice that for hour 9 and 10, there's small but siginficant postive correlation with retweet count. \ We set 9.a.m. and 10 a.m. to be the peak hours and furthur anlayze the correlation with varying threshold of retweet counts.
non_retweets['peak'] = non_retweets.apply(lambda x: 1 if x.hour in([9,10]) else 0,axis=1)
print("The spearman coefficient between retweet count and whether a tweet is posted during 9 or 10 a.m. \
for tweets with retweet count > N is listed below: ")
for ind,N in enumerate([0,20,30,40,50,60,100,200,300]):
non_retweets_sset = non_retweets[non_retweets.retweet_count > N]
coef, p = stats.spearmanr(non_retweets_sset['retweet_count'],non_retweets_sset['peak'])
print("N = {:3d}: coefficient:{:.10f}, p-value:{:}".format(N,coef,p))
The spearman coefficient between retweet count and whether a tweet is posted during 9 or 10 a.m. for tweets with retweet count > N is listed below: N = 0: coefficient:0.0300056407, p-value:4.863931720838995e-40 N = 20: coefficient:0.2646923894, p-value:1.0240165362540023e-32 N = 30: coefficient:0.1627751961, p-value:1.2037197543297452e-09 N = 40: coefficient:0.0579364979, p-value:0.06268678143488414 N = 50: coefficient:-0.0055401190, p-value:0.8738334333212042 N = 60: coefficient:-0.1180047818, p-value:0.002268776948644229 N = 100: coefficient:-0.1872577339, p-value:0.0013061780153343309 N = 200: coefficient:-0.2119990192, p-value:0.15724091171483665 N = 300: coefficient:-0.3738335897, p-value:0.0865542415208182
There is a small (0.16), but significant (p < 0.05) positive correlation when N <=30, which may indicate that if you just started out on Twitter, maybe you can try tweeting at 9 or 10 a.m.
We don't have enough data available to train a decent classifier, so instead of propensity score matching,we will find some users to examine if the time of posting influences the retweet count.
# group by egoID and get the number of tweets with retweet_count > 10 for each user
most_tweeted[most_tweeted['retweet_count']<200].groupby('egoID')['created_at'].count().sort_values()
egoID
22 1
14728 1
14733 1
14759 1
14924 1
...
10348 154
22881 226
21259 356
2604 426
18670 824
Name: created_at, Length: 380, dtype: int64
User 18670 have 824 tweets that are tweeted more than 10 times. We take a closer look at his/her tweets.
# single out user 18670
u18670 = most_tweeted[most_tweeted['egoID']==18670].copy()
u18670['peak'] = u18670.apply(lambda x: 1 if x.hour in([9,10]) else 0,axis=1)
fig = px.violin(u18670, y="retweet_count", color="peak", box=True, points="all",
title = 'Distribution of Retweet Count for User 18670\'s Tweets',
hover_data=['retweet_count','kind','weekday','hour','month','year'],
template= 'simple_white')
fig.write_html('./plots/u18670.html')
fig.show()
Though user 18670 tweeted very often during the peak hours, the time of the tweet doesn't show effect on the number of retweets he/she gets. So when you already have a lot followers(rember this is the same super influencer we analyzed in step 2.3), maybe you can tweet whenever you want.
To analyze the effect of holiday, we choose the most celebrated holiday in English speaking countries:Christmas. In the analysis below we keep countries where Christmas is a national public holiday. Specifically, users with language: en en-gb es fr nl pt\ We exclude year 2014 from our analysis here, since that year's data is not complete(see step 3.2)
# keep only countries that celebrate christmas, and we exclude 2014 from analysis
xmas_countr = time[time.apply(lambda x: x['lang'] in ['en','en-gb','es','fr','nl','pt'],axis =1)]
xmas_countr = xmas_countr[xmas_countr['year']!= 2014]
Group by day of the year, and get the average retweet count for each day.
whole_year = xmas_countr.groupby(['month',xmas_countr.local_time.dt.day]).agg(
{'retweet_count': 'mean'})
whole_year = whole_year.reset_index()
whole_year.rename(columns={'retweet_count':'average_retweet_count'},inplace=True)
whole_year['date'] = whole_year.apply(lambda x: str(int(x.month))+'.'+str(int(x.local_time)),axis =1)
fig = px.line(whole_year, x='date', y='average_retweet_count',
title='Trend of Average Retweet Count per Tweet')
fig.update_xaxes(rangeslider_visible=True)
fig.update_yaxes(showgrid = False)
fig.write_html('./plots/xmas.html')
fig.show()
The average number of retweets clearly peaked on Christmas eve, maybe a good time to tweet!